Automated Evaluation of Essays and Short Answers

نویسندگان

  • Jill Burstein
  • Claudia Leacock
  • Richard Swartz
چکیده

Essay questions designed to measure writing ability, along with open-ended questions requiring short answers, are highly-valued components of effective assessment programs, but the expense and logistics of scoring them reliably often present a barrier to their use. Extensive research and development efforts at Educational Testing Service (ETS) over the past several years (see http://www.ets.org/research/erater.html) in natural language processing have produced two applications with the potential to dramatically reduce the difficulties associated with scoring these types of assessments. The first of these, e-raterTM, is a software application designed to produce holistic scores for essays based on the features of effective writing that faculty readers typically use: organization, sentence structure, and content. The e-rater software is "trained" with sets of essays scored by faculty readers so that it can accurately "predict" the holistic score a reader would give to an essay. ETS implemented e-rater as part of the operational scoring process for the Graduate Management Admissions Test (GMAT) in 1999. Since then, over 750,000 GMAT essays have been scored, with e-rater and reader agreement rates consistently above 97%. The e-rater scoring capability is now available for use by institutions via the Internet through the Criterion Online Writing Evaluation service at http://www.etstechnologies.com/criterion. The service is being used for both instruction and assessment by middle schools, high schools, and colleges in the U.S. ETS Technologies is also conducting research that explores the feasibility of automated scoring of short-answer content-based responses, such as those based on questions that appear in a textbook's chapter review section. If successful, this research has the potential to evolve into an automated scoring application that would be appropriate for evaluating short-answer constructed responses in online instruction and assessment applications in virtually all disciplines. E-rater History and Design Educational Testing Service (ETS) has pursued research in writing assessment since its founding in 1947. ETS administered the Naval Academy English Examination and the Foreign Service Examination as early as 1948 (ETS Annual Report, 1949-50), and the Advanced Placement (AP) essay exam was administered in Spring of 1956. Some of the earliest research in writing assessment (see Coward, 1950 and Huddleston, 1952) laid the foundation for holistic scoring which continues to be used by ETS for large-scale writing assessments. Currently several large-scale assessment programs contain a writing measure: the Graduate Management Admissions Test (GMAT), the Test of English as a Foreign Language (TOEFL), the Graduate Record Examination (GRE), Professional Assessments for Beginning Teachers (PRAXIS), the College Board’s SAT II Writing Test and Advanced Placement (AP) exam, and the College-Level Examination Program (CLEP) English and Writing Tests. Some of these tests have moved to computerbased delivery, including the GMAT AWA, TOEFL, and GRE. The migration to computer-based delivery of these tests, along with the collection of examinee essay data in digital form, has permitted the exploration and use of automated methods for generating essay scores. In February 1999, ETS began to use e-rater for operational scoring of the GMAT Analytical Writing Assessment (see Burstein, et al and Kukich, 2000). The GMAT AWA has two test question types (prompts): the issue prompt and the argument prompts. Prior to the use of e-rater, both the paper-and-pencil, and initial computer-based versions of the GMAT AWA were scored by two human readers on a six-point holistic scale. A final score was assigned to an essay response based on the original two reader scores if these two scores differed by no more than one score point. If the two readers were discrepant by more than one point, a third reader score was introduced to resolve the final score. Since February 1999, an e-rater score and one human reader assigned a score to an essay. Using the GMAT score resolution procedures for two human readers, if the erater and human reader scores differed by more than one-point, a second human reader resolved the discrepancy. Otherwise, if the e-rater and human reader score agreed within one-point, these two scores were used to compute the final score for the essay. Since e-rater was made operational for GMAT AWA scoring, it has scored over 750,000 essays – approximately 375,00 essays per year. The reported discrepancy rate between e-rater and one human reader score has been less than three percent. This is comparable to the discrepancy rater between two human readers. E-rater Design and Holistic Scoring Holistic essay scoring has been researched since the 1960’s (Godshalk, 1966) and departs from the traditional, analytical system of teaching and evaluating writing. In the holistic scoring approach, readers are told to read quickly for a total impression and to take into account all aspects of writing as specified in the scoring guide. The final score is based on the readers total impression (Conlan, 1980). From e-rater’s inception, it has always been a goal that the features used by the system to assign an essay score be related to the holistic scoring guide features. Generally speaking, the scoring guide indicates that an essay that stays on the topic of the question, has a strong, coherent and well-organized argument structure, and displays a variety of word use and syntactic structure will receive a score at the higher end of the six-point scale (5 or 6). E-rater features include discourse structure, syntactic structure, and analysis of vocabulary usage (topical analysis). Natural Language Processing (NLP) in E-rater Natural language processing (NLP) is the application of computational methods to analyze characteristics of electronic files of text or speech. In this section, only textbased applications are discussed. Methods used are either statistical, or linguisticbased analyses of language features. NLP applications utilize tools such as syntactic parsers, to analyze the syntactic form of a text (Abney, 1996); discourse parsers, to analyze the discourse structure of a text (Marcu, 2000); lexical similarity measures, to analyze word use of a text (Salton, 1989).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Towards an automated system for short-answer assessment using ontology mapping

A key concern for any e-assessment tool (computer assisted assessment) is its efficiency in assessing the learner’s knowledge, skill set and ability. Multiple-choice questions are the most common means of assessment used in e-assessment systems, and are also successful. An efficient e-assessment system should use variety of question types including shortanswers, essays etc. and modes of respons...

متن کامل

Automated Scoring of Handwritten Essays Based on Latent Semantic Analysis

Handwritten essays are widely used in educational assessments, particularly in classroom instruction. This paper concerns the design of an automated system for performing the task of taking as input scanned images of handwritten student essays in reading comprehension tests and to produce as output scores for the answers which are analogous to those provided by human scorers. The system is base...

متن کامل

A Hybrid Method of Syntactic Feature and Latent Semantic Analysis for Automatic Arabic Essay Scoring

Background: The process of automated essays assessments is a challenging task due to the need of comprehensive evaluation in order to validate the answers accurately. The challenge increases when dealing with Arabic language where, morphology, semantic and syntactic are complex. Methodology: There are few research efforts have been proposed for Automatic Essays Scoring (AES) in Arabic. However,...

متن کامل

Considering Misconceptions in Automatic Essay Scoring with A-TEST - Amrita Test Evaluation and Scoring Tool

In large classrooms with limited teacher time, there is a need for automatic evaluation of text answers and real-time personalized feedback during the learning process. In this paper, we discuss Amrita Test Evaluation & Scoring Tool (A-TEST), a text evaluation and scoring tool that learns from course materials and from human-rater scored text answers and also directly from teacher input. We use...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001